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Abstract 

A  Blue  airborne  force  attacks  a  region  defended  by  a  single  Red  surface-to-air 
missile  system  (SAM).  Red  is  uncertain  about  the  Blues  he  faces,  but  is  able  to 
learn  about  them  during  the  engagement.  Red’s  objective  is  to  develop  a  policy 
for  shooting  at  the  Blues  to  maximise  the  value  of  Blues  shot  down  before  he 
himself  is  destroyed.  We  show  that  index  policies  are  optimal  for  Red  in  a  range 
of  scenarios  and  yield  effective  heuristics  more  generally.  The  quality  of  such 
index  heuristics  is  confirmed  in  a  computational  study. 

1.  Introduction  and  Basic  Scenario 

The  following  scenario  is  a  simplified  version  of  one  occurring  when  a  Blue  airborne 
force  attacks  a  Red  region  defended  by  a  Red  missile  system;  see  Barkdoll,  et  al  (2001). 

A  single  Red  surface-to-air  missile  (SAM) — thereafter,  simply  Red — can  attack  and 
be  attacked  by  a  collection  of  N  Blue  airborne  attackers,  labelled  1  through  N.  Blues 
come  in  B  types,  but  Red  only  has  imperfect  information  concerning  the  nature  of  the 
Blues  he  is  facing.  Red  is  able  to  construct  N (independent)  prior  distributions  n  ,  n  ,..., 
which  summarise  his  beliefs  about  the  type  identities  of  the  Blues  before  any  shooting 
starts.  Hence  11*  is  the  probability  that  Red  assigns  to  the  event  “Blue  number  j  is  of  type 
b,  \  <j<N,  1  <  6  <  R”  in  advance  of  action.  At  each  time  t  =  0,1,2,...  Red  shoots  at  a 
single  Blue  and  that  Blue  retaliates  by  firing  back  on  Red.  Red  has  a  (constant) 
probability  rb  of  destroying  a  type  b  Blue  with  a  single  shot,  and  has  (constant) 
probability  6b  of  being  destroyed  by  a  retaliatory  strike.  Red  knows  when  a  Blue  has  been 
destroyed  because  no  retaliatory  strike  follows.  All  shooting  outcomes  are  assxuned  to  be 
independent  of  each  other.  If  Red  destroys  a  type  b  Blue  with  his  shot  then  he  receives 
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a  reward  VbC^,  where  Vb  is  the  utility  associated  with  this  occurrence  and  a  e  [0,1]  is  a 
discount  rate.  Red’s  goal  is  to  maximise  the  expected  utility  of  Blues  destroyed  prior  to 
his  own  destruction. 

A  crucial  feature  of  the  model  concerns  Red’s  capacity  to  update  his  beliefs  about  the 
Blues  he  is  facing  in  the  light  of  the  outcomes  of  past  engagements,  by  using  Bayes’ 
Theorem.  In  particular,  if  Blue  target  J  has  been  involved  in  n  engagements  and  he  and 
Red  have  survived  them  all  (note  that  this  is  the  only  event  of  interest  for  future 
decision-making)  then  the  prior  IT  becomes  the  posterior  IT’"  given  by 


n 


y.n 

b 


ni(l-r,)-(l-g,)- 


\<b<B. 


(1) 


Hence,  Red’s  beliefs  about  the  Blues  are  evolving  and  this  will  plainly  impact  his 
shooting  decisions. 


2.  An  Index  Result  for  a  Class  of  Generalised  Bandit  Problems 

The  above  problem  will  be  analysed  by  means  of  a  result  due  to  Nash  (1980),  in  a 
contribution  that  developed  the  classical  index  result  of  Gittins  and  Jones  (1974).  Nash 
envisages  N  “bandits”,  the of  which  is  in  state  Xj(t)  at  time  t.  A  decision-maker  chooses 
one  of  the  bandits  to  process  at  each  time  r  =  0,1,2. . ..  The  effect  of  choosing  bandit  j  at 
time  t  is  as  follows: 

(i)  bandit  j  experiences  a  Markovian  change  of  state  Xj{t)  — >  Xjit+  1).  Bandits  not 
chosen  remain  fixed; 

(ii)  a  reward  a‘  (0}-^./  {^j  (0}}  generated. 

The  novelty  of  this  model  concerned  the  multiplicatively  separable  reward  structure 
in  (ii)  above.  Here  all  bandits  make  a  contribution  to  the  rewards  generated  when  j  is 
chosen  through  the  so-called  influence  functions  qi.  The  ^’s  and  R’s  are  non-negative 
and  bounded. 
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If  some  policy  v  is  used  for  choosing  bandits,  with  v(t)  used  for  the  choice  made  at  t, 
then  the  total  return  under  Nash’s  model  can  be  written 


E.  •  (2) 

t=0  i^v{t) 

The  goal  is  to  choose  v  to  maximise  the  return  in  (2). 

Nash  was  able  to  show  that,  imder  certain  conditions  (which  are  satisfied  in  all  of  the 
problems  discussed  here),  this  problem  has  an  index  solution  of  the  following  character: 
at  each  time  t,  compute  a  calibrating  index 

Gi{Zi(0},  G2{X2{t)}, ...,  GNiX^it)} 

for  each  bandit  in  its  current  state.  An  optimal  policy  will  always  choose  that  one  of  the 
bandits  with  the  largest  index.  It  does  not  matter  how  ties  are  broken. 

We  can  deploy  Nash’s  model  to  solve  our  problem  as  follows:  the  bandits  correspond 
to  the  N  Blues.  The  state  Xj(t)  of  Blue  j  at  time  t  has  three  components,  labelled  rT(r), 
I{{t)  and  Isit)-  Here  lT(r)  is  the  posterior  distribution  for  Blue  j  describing  Red’s  current 
beliefs  about  it — see  (1).  Both  and  li{t)  are  indicator  fimctions  as  follows: 

0,  if  by  time  tj  has  destroyed  Red, 

1,  otherwise, 

and 

{0,  if  by  time  t,  j  has  been  destroyed  by  Red, 

1,  otherwise. 


To  deploy  Nash’s  model  for  our  problem  we  make  the  following  choices  for  each  j: 


q,{Xj(t)}  =  im. 

=  0  whenever /;?(?)  =  0  or  I  sit)  =  0. 


(3) 


Otherwise  Rj  records  a  single  return  when  Blue  j  is  destroyed. 

The  effect  of  the  choices  in  (3),  when  placed  within  Nash’s  reward  structure,  is 
(a)  to  wipe  out  any  further  returns  following  Red’s  destruction  by  any  Blue y;  and 
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(b)  to  wipe  out  further  returns  from  Blue  j  following  its  own  destruction. 


This  is  precisely  what  we  want.  The  total  return  in  (2)  is  now  exactly  the  expected  utility 
of  Blues  destroyed  until  Red’s  own  destruction. 

We  return  now  to  Nash’s  general  model,  but  we  shall  exploit  the  fact  that  our  q 
functions  (from  (3))  all  have  starting  values  1,  which  remain  there  until  a  possible 
transition  to  0.  This  simplifies  the  index  structure  considerably.  Consider  Blue  j  in  some 
state  jc  for  which  qjix)  =  1 .  We  shall  describe  the  index  Gj{x)  which  is  used  in  determining 
the  optimal  policy.  Imagine  Red  shooting  at  Blue  j  (from  initial  state  x  at  ?  =  0)  until  some 
positive-valued  stopping  time  t,  defined  with  respect  to  Blue’s  evolving  state.  Define  the 
reward  rate  G/x,  r)  earned  up  to  rby 


Gy(x,r)  = 


\-^a’qj{Xj(T%Xj(a)  =  x\  ' 


(4) 


The  index  G/x)  is  the  largest  such  reward  rate,  namely 


Gy  (x)  =  sup  Gj  (x,  r) .  (5) 

r>0 

In  the  next  section  we  show  how  to  develop  indices  for  the  problem  in  Section  1.  A 
general  methodology  for  index  computation  for  Nash’s  model  may  be  found  in 
Glazebrook  and  Greatrix  (1995).  Other  discussions  of  Nash’s  model  are  found  in  Fay  and 
Glazebrook  (1987),  Glazebrook  and  Owen  (1991),  Glazebrook  and  Greatrix  (1993),  and 
Glazebrook  (1993). 


3.  Indices  for  the  Blues 

The  problem  in  Section  1  may  be  formulated  as  a  Bayes  sequential  decision  problem 
(in  which  the  expected  reward  is  taken  with  respect  to  the  prior  distributions  IT,  I  <j<N, 
as  well  as  over  the  realisations  of  the  engagement)  whose  structure  conforms  to  Nash’s 
generalised  bandit,  as  outlined  in  Section  2.  Hence,  all  we  have  to  do  is  specify  what  the 
indices  Gy  are  which  determine  optimal  policies  for  Red.  In  discussing  this  we  can 
concentrate  on  individual  Blues,  and  hence,  drop  the  Blue  identifier  j. 
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Consider  a  Blue  target  whose  associated  prior  is  IT  and  which  has  had  n  engagements 
with  Red,  which  have  left  both  of  them  intact  {Ir  =  Ib=V).  Refer  to  this  state  as  (Tl,  n). 
For  the  purposes  of  Red’s  decision  making  it  is  only  such  Blues  and  such  states  which  are 
of  interest.  In  Red’s  next  engagement  with  this  Blue,  three  things  can  happen:  (1)  Blue  is 
destroyed  and  Red  not;  (2)  Red  is  destroyed  and  Blue  not;  and  (3)  neither  is  destroyed.  In 
the  formulation  as  a  Bayesian  sequential  decision  problem  we  use  the  posterior  in  (1)  to 

develop  the  probabilities  of  these  three  events  as 

(1)  p[Blue  destroyed  and  Red  not]  =  ! D(n.,n), 

(3)  /^[neither  destroyed]  =  2]^ni(l-ri)"^\l-(9i)''''7^(n,n), 

(2)  /?[Red  is  destroyed  and  Blue  not]  =  1  - /j[Blue  destroyed  and  Red  not]  - 
/^[neither  destroyed]. 

In  ( 1 )  and  (3)  we  take  D(n,  n)  =  n*  (1  -  r* )"  (1  -  <9* )" . 

Further,  the  expected  return  for  Red  from  the  next  engagement  is  given  by 


Now,  in  following  the  prescription  for  computing  the  index  at  the  end  of  Section  2  we 
only  need  (for  theoretical  reasons)  to  consider  certain  kinds  of  stopping  time  r  in  our 
determination  of  the  index  G(n,  n)  of  the  Blue  under  discussion.  Specify  positive  integer 
r(>  1).  We  write  tr  for  Red’s  stopping  time  in  which,  from  time  0  (at  which  point  the 
state  of  the  Blue  is  assumed  to  be  (11,  n)).  Red  has  r  further  engagements  with  Blue 
imless  one  or  other  of  them  is  destroyed  first.  The  random  variable  Zr  is  the  munber  of 
shots  from  Red  that  results  from  this,  and  cannot  exceed  r  or  be  less  than  one.  The 
expected  reward  up  to  Zr,  which  is  required  for  the  numerator  in  (4),  may  be  expressed  as 


X,  n.  (1  -  r,  )■  (1  -  (1  -  r,  )■  (1  -  e,Y] 

D(U,n) 

and  the  expression  in  the  denominator  (recall  that  q  is  just  7^})  is 


(6) 
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D{U,n) 


(7) 


From  (5),  (6)  and  (7)  the  index  G(n,  n)  may  be  developed  as 


G(n,n)  =  max 

r^l 


(8) 


where 


r-l 


4i  =  2]  (1  -  n )'  (1  -  Ob  y 


5=0 


and 

Alb  —  (1~^a)  • 

We  can  now  implement  an  optimal  policy.  If  Red  is  still  alive,  then  he  computes  all  the 
indices  for  the  still  live  Blues  and  engages  next  whichever  live  Blue  has  the  largest  index. 

In  order  to  understand  index  structure,  introduce  the  so-called  “one-step  index” 
i/(n,  n)  obtained  by  taking  r  =  1  in  (8)  as 

aY,,u,{\-nn\-e,yv,r, 


H{U,n)  = 


Y^u,{\-ny{\-e,y{{\-a)+ae,{i-r,)Y 


(9) 


It  is  straightforward  to  establish  the  following: 

(i)  If  i;f(n,  n)  is  decreasing  in  n,  then  the  maximum  in  (8)  is  attained  at  r  =  1  for  all 
n  and  it  then  follows  that  G(n,  n)  =  H(J1,  n)  for  all  n.  If  this  behaviour  holds 
good  for  all  Blues  then  the  index  policy  is  quasi-myopic  (a  one-step  look  ahead 
rule).  Here  indices  are  always  decreasing,  and  so  in  an  optimal  policy,  which 
always  targets  the  Blue  with  the  largest  index.  Red  will  switch  his  targeting  of 
the  Blues  frequently. 

(ii)  If  H(Ti,  n)  is  increasing  in  n,  then  the  maximum  in  (8)  is  attained  for  all  n  in  the 
limit  as  r-^co.  Here  G(n,  n)  can  be  shown  to  be  increasing  in  n.  If  this 
behaviour  holds  good  for  all  Blues  then  Red  will,  in  an  optimal  policy,  persist  in 
targeting  individual  Blues  in  turn  until  each  is  destroyed; 
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(iii)  If  there  are  just  two  Blue  types  {B  =  2),  then  H(J1,  n)  will  be  either  increasing  or 
decreasing  in  n; 

(iv)  H(n.,  n)  may  be  thought  of  (somewhat  crudely)  as  a  weighted  average  (with 
respect  to  the  posterior  distribution)  of  a  vulnerability  index 

V,nl{{\-a)  +  a9,{\-r,)] 

for  Blues  of  type  b.  This  vulnerability  index  is  high  when  Vb  and  rb  are  large  and 
when  6b  is  small.  It  is  plainly  such  Blues  that  Red  would  like  to  shoot  at.  In  fact, 
H(n.,  n)  takes  expectations  for  the  numerator  and  denominator  of  the 
vulnerability  index  separately.  The  index  formula  in  (8)  tells  Red  exactly  how 
to  choose. 

A  variety  of  extensions  to  the  above  are  available  from  standard  index  theory.  Two  are, 
perhaps,  worthy  of  mention: 

(a)  When  new  Blues  arrive  for  engagement  in  a  Poisson  fashion,  an  index  policy  is 
still  optimal.  The  index  in  (8)  is  not  always  quite  the  right  one,  but  will  do  very 
well  in  practice.  See  Fay  and  Glazebrook  (1992); 

(b)  If  there  are  several  identical  Red  shooters  operating  in  parallel,  instead  of  just 
one,  and  the  Red  objective  is  to  maximise  the  utility  from  destroying  Blues  until 
all  Reds  are  destroyed,  then  the  above  index  policy  (operated  in  the  obvious  way) 
will  do  very  well,  but  will  not,  in  general,  be  strictly  optimal.  See  Glazebrook 
and  Garbe  (1998). 

4.  Some  Major  Extensions 

We  elaborate  the  scenario  in  Section  1  by  supposing  that  Red  could  be  one  of  several 
(R)  Red  types  and  each  Blue  has  at  his  disposal  several  weapons,  some  of  which  may  be 
designed  for  use  against  particular  Red  types.  In  this  situation,  each  Blue  will  seek  to 
learn  about  what  kind  of  Red  type  he  faces  as  well  as  vice-versa.  We  shall  assume  that 
the  individual  Blues  can  only  learn  about  Red  independently  of  each  other — ^they  caimot 
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pool  information.  We  shall  consider  a  range  of  approaches,  in  increasing  levels  of 
complexity.  Note  that  there  are  minor  variants  of  most  of  the  following  proposals  and 
most  of  the  objects  described  can  be  j  (i.e.,  target)  dependent. 

(a)  Blue’s  strategy  known  to  Red 

The  simplest  option  is  to  suppose  that  each  Blue  type  b  has  a  strategy  for 
choosing  successive  weapons  in  the  face  of  inconclusive  engagements  and  that 
these  strategies  are  known  to  Red.  Hence,  for  each  blue  type  b,  there  is  a 
sequence  {Wb{n),  n>l}  of  weapons  to  be  used.  Note  that  we  do  not  actually 
require  that  all  Blues  of  type  b  have  the  same  strategy — ^that  is  just  here  for 
simplicity.  An  index  policy  is  still  optimal  and  the  indices  concerned  involve 
minor  adjustments  to  (8).  We  write 

'e{b,n)  =  ^\].-ew,{m)]  (10) 

m=l 

where  9w  is  the  kill  probability  for  weapon  W.  The  index  for  this  situation  may 
be  shown  to  be 

G(n,n)  =  max 

r£l 

where 

Rii  =  ^a'^^rb{\-ny'9{b,n  +  s) 

5-0 

and 

B2b  =  a’'{\-ri,)'"9{b,n  +  r). 

(b)  Blue’s  beliefs  known  to  Red 

This  is,  in  fact,  a  simple  example  of  (a)  in  which  Blue’s  strategies  Wb  = 
n  >  1 }  are  developed  as  Blue  type  b  updates  his  prior  beliefs  about  the  Red 
type  he  faces.  This  notation  presupposes  that  all  Blues  of  the  same  type  will  have 


E,  n.  (1  -  )”  {E;;;  (1  -  r,  y  e{b,  n + s)} 


(11) 
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the  same  priors,  but  this  is  not  an  essential  feature.  If  Red  has  access  both  to  the 
P*’s  and  also  to  how  Blue  is  using  his  posterior  beliefs  to  choose  successive 
weapons,  then  he  has  access  to  Blue’s  strategy  and  a  suitable  form  of  the  index  in 
(11)  can  be  used. 

(c)  Blue’s  beliefs  not  known  to  Red 

The  approaches  in  (b)  will  yield  an  optimal  index  policy  whose  return 
F^,  P^)  will  depend  upon  the  priors  P*  describing  Blue’s  initial  beliefs 

about  Red.  How  do  we  proceed  if  we  drop  the  assumption  that  Red  knows  the 
jP^’s?  The  two  classical  decision-theoretic  approaches  are: 

(1)  Suppose  Red  is  minimax 

Here  Red  acts  conservatively  and  chooses  the  best  (i.e.,  index)  policy  for  the 
“least  favourable”  priors.  For  most  reasonable  models,  this  will  amoimt  to 
Red  supposing  that  all  Blues  know  what  kind  of  Red  type  he  is  and 
calculating  indices  accordingly. 

(2)  Suppose  Red  is  Bayes 

Here  Red  expresses  his  beliefs  about  the  unknown  P*’s  via  appropriate  prior 
distributions  ^^(p).  We  are  putting  priors  on  priors,  each  of  the  latter  being 
an  i?-dimensional  probability  vector  p.  Indices  can  now  be  developed  as 
follows: 

For  each  b  we  have 


p  ->  weapon  sequence  ^(b,n,p), «  >  1, 

extending  (10).  The  index  (11)  now  is  developed  to  become 

G(n,«)  = 


max 

ra 


ZJS  i  (1  -  rj )"  {1  -  C,6  (p)  -  (p)}<z5i  (p)^/p 


(12) 


where 
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and 


<^16  (P)  =  £  -  n  Y  0{b,  n+s,p) 


5=0 


C2b{v)  =  «'■(!-  n  Y&{b,n  +  r,p) 

and  such  indices  determine  the  optimal  policy  for  the  Bayesian  Red.  In  this 
formulation,  Red  can  make  inferences  about  Blue’s  evolving  beliefs  about 
what  kind  of  Red  he  is.  For  example,  if  R  =  5  and  a  Blue  type  possesses  5 
weapons,  each  one  potent  against  one  of  the  5  different  Red  types  and 
ineffectual  against  the  others,  then  after  4  inconclusive  engagements.  Red 
will  tmderstand  that  such  a  Blue  type  now  almost  certainly  has  a  clear  view 
of  what  kind  of  Red  he  is  and  that  such  a  Blue’s  next  retaliation  could  well 
be  fatal  for  him.  The  index  in  (12)  will  reflect  these  developing  beliefs. 

An  assumption  that  Blues  can  pool  their  information  about  Red  will  induce  stochastic 
dependence  among  the  Blues.  Appropriately  developed  indices  can  do  well  but  will  not 
be  strictly  optimal.  See,  for  example.  Boys,  Glazebrook  and  McCrone  (1996). 
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Appendix  A 


Issues  for  the  Blue  force 

The  scenario  is  as  in  Section  1  of  the  main  report  where  the  primary  focus  is  on  Red’s 
decision-making.  However,  the  controller  of  the  Blue  force  also  faces  some  issues.  A 
natural  first  question  for  Blue  concerns  what  force  he  needs  to  deploy  in  order  to  destroy 
an  optimally  shooting  Red  with  a  given  large  probability,  0.95,  say.  This,  in  fact,  turns 
out  to  be  straightforward  to  assess.  Suppose  that  Nt  type  b  Blues  are  deployed,  1  <b<B. 
The  probability  of  Red’s  ultimate  survival  (having  destroyed  all  Blues)  does  not  depend 
upon  his  strategy  for  engaging  them.  Hence,  we  may  as  well  suppose  that  Red  engages 
each  Blue  in  a  continuous  fight  imtil  one  or  the  other  is  destroyed.  In  such  an  engagement 
it  is  easy  to  show  that 

/j[Blue  of  type  b  is  destroyed]  =  /*  (1  -  r* )  =  ,  say,  1  <b<B. 

Hence,  the  probability  that  Red  survives  the  battle  with  Nj,  type  b  Blues,  \<b<B,  is 
given  by 

and  the  controller  of  Blue  requires  this  to  be  less  than  or  equal  to  0.05,  say.  If  there  is 
only  one  Blue  type,  then  the  choice  is  of  the  smallest  number  A"  to  deploy  such  that 

</<0.05. 

If  we  now  ask  how  the  Blue  force  should  accomplish  the  destruction  of  Red  with 
given  probability  at  least  cost  to  itself,  then  the  strategy  for  Red  does  come  into  play 
since,  for  example.  Red  may  tend  to  engage  “expensive”  Blues  first.  Hence,  we  shall 
suppose  that  Red  shoots  optimally,  and  will  consider  a  simple  situation  for  Blue  in  which 
B  =  2  and  the  loss  of  each  type  b  costs  him  C*,  P  =  1,2.  We  note  from  the  main  report  that 
when  R  =  2,  all  indices  are  either  increasing  or  decreasing  in  n.  We  shall  suppose  that  the 
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fomer  is  the  case  for  all  Blues  and  so  Red’s  optimal  shooting  policy  engages  each  Blue 
non-preemptively  imtil  one  or  other  is  destroyed. 

Let  the  expected  cost  to  the  Blue  force  of  the  deployment  of  Nt  type  b's,  b  =  l,2 
against  an  optimally  shooting  Red  be  denoted  C{N\,  N2).  Blue’s  optimisation  problem  is 

minimise  C(Ni ,  N2 ) 

NuN2 

such  that  (13) 

where  1  >  £■>  0  and  1  -  ^■is  the  desired  probability  of  killing  Red. 

We  describe  a  scenario  in  which  C{N\,  N2)  may  be  computed  easily.  Red’s  priors  for 
the  Blue  types  he  faces  are  obtained  by  moderating  his  ignorance  about  them  (as  initially 
expressed  by  />(Blue  is  of  type  b)  =  0.5,  h  =  1,2)  by  means  of  information  obtained  from  a 
sensor.  This  sensor  can  only  judge  Blue  type  with  error.  We  have 

/^[Blue  judged  to  be  of  type  ^|Blue  is  of  type  62]  =  ^*,*2  • 

Hence,  Red  allocates  to  each  Blue  one  of  two  possible  priors  H*,  6  =  1,2  according  to  the 
judgement  of  the  sensor.  We  have 

nl  =  /?[Blue  is  of  type  l|Blue  judged  to  be  of  type  l]  =  <^n/(^ii  +  ^12) 
and  similarly  for  the  other  probabilities.  Let  Xi  be  a  Bin(M,  ^1)  random  variable 
representing  the  number  of  the  M  type  1  Blue  types  judged  by  the  sensor  to  be  of  type  1 
and  hence,  given  prior  n'  by  Red.  Similarly,  X2  ~  Bin(A^2,  ^2)-  Red  faces  Xi  +  N2  -  X2 
Blue  types  to  which  he  allocates  prior  fl'  and  initial  index  Gi  and  N1-X1+  X2  Blue  types 
to  which  he  allocates  prior  and  initial  index  G2.  Suppose  G\  >  G2  and  so  Red  first 
engages  all  those  Blues  judged  to  be  of  type  1,  followed  by  those  judged  to  be  of  type  2. 
We  assume  that  if  Red  faces  two  or  more  Blues  with  the  same  index  then  he  chooses 
between  them  at  random. 

Now  the  cost  of  engaging  bi  (fixed)  type  1  Blues  and  b2  (fixed)  type  2  Blues  in 
random  order  can  be  computed  recursively  by 
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c(0,0)  =  0. 

Hence,  the  desired  expected  cost  to  Blue  of  the  chosen  deployment  is  given  by 

C{N„N2)  =  E{c{X„N,-X2)+y/f'y/^^-^^c{N,-XM}. 

This  can  now  be  used  in  (13). 

In  more  complicated  situations,  Blue’s  expected  cost  may  be  computed  via  suitable 
development  of  the  methodologies  described  by  Bertsimas  and  Nino-Mora  (1996)  for 
multi-armed  bandits. 
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Appendix  B 


Shoot-Look-Shoot  for  Red 

We  elaborate  the  scenario  described  in  Section  1  of  the  main  report  in  two  ways: 

(i)  after  every  shot  by  Red,  the  targetted  Blue  is  inspected  and  categorised  (with 
error)  according  to  type  and  alive/dead.  Write 

/7[Blue  judged  to  be  of  type  b,  |Blue  is  alive  of  type  62]  =  <f>hbi 

/7[Blue  judged  to  be  of  type  jBlue  is  dead  of  type  £>2]  =  ^bibi  > 


where  1  <  b\,  bi  <  B. 


(ii)  the  Blue  targetted  by  Red  may  or  may  not  retaliate.  We  now  have  Sb  for  the 
probability  of  retaliation  to  a  single  shot  for  live  Blues  of  type  b.  Dead  Blues  do 
not  fire  back. 

Inter  alia,  (ii)  enables  us  to  consider  the  deployment  of  decoys  by  Blue. 

Red  now  gathers  information  about  the  Blues  he  is  facing  in  a  much  more 
complicated  way  than  previously.  Index  policies  are  still  optimal,  but  the  index  structure 
is  more  complex  and  simple  closed  forms  as  in  (8)  above  must  not  be  expected.  Consider 
a  Blue  target  with  assigned  prior  II.  Sufficient  statistics  gleaned  fi-om  the  history  of  Red’s 
past  engagements  with  this  Blue,  which  will  determine  Red’s  posterior  distribution  for 
this  target,  are: 

(a)  the  number  of  Red  shots  faced  by  this  target  («); 

(b)  the  outcome  of  the  subsequent  inspections  (b  =  {bi,b2, bn})', 

(c)  the  number  of  retaliations  by  Blue  (m); 

(d)  the  shot  by  Red  to  which  Blue  last  retaliated  (A:). 

Note  that  m<k<n.  The  posterior  probability  that,  given  n,  b,  m,  k.  Blue  is  of  type  b  and 
is  still  alive  is  proportional  to 
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where  H  is  used  as  a  shorthand  for  the  history  (n,  m,  A:).  The  posterior  probability  that, 
given  n,  b,  m,  k,  Blue  is  of  type  b  but  is  now  dead  is  proportional  to 

n,(l  -  r,)*(l  - 

n-k-\  f  k-¥t  n  \ 

Y\^b,b  (15) 

t=0  V  i=l  J\i=k+t+\  J 

=  ni,Pb{n,b,m,k)  =  UbPbiH) 

as  before.  Hence,  given  history  H,  the  posterior  probabilities  are  given  by 
p[Blue  alive  of  type  b|H]  =  + 

p[Blue  dead  of  type  6|H]  =  UbPb{H)/Y,nb{Pb{H)  +  Pb{H)}. 

The  corresponding  one-step  index  for  a  Blue  with  prior  IT  and  history  H  is  given  by 

_ a'Z^nbPb(H)Vbrb _ 

Y,,  n,  {Pb  (//)[l  -  a{rb  +{l-rb  )(1  -  Sb^b )}]  +  Pt  iH)il  -a)]’ 

and  will  frequently  yield  good  shooting  policies. 

In  order  to  develop  the  index  G(n,  H)  for  a  Blue  with  prior  11  and  history  H  that  can 
be  used  to  determine  optimal  policies  for  Red,  we  require  an  iterative  procedme  due  to 
Glazebrook  and  Greatrix  (1995).  Denote  by  Cl(H)  the  set  of  histories  reachable  (in  the 
obvious  sense)  from  history and  R{Q(i^)}  the  set  of  bounded  functions  on 

If  H-{n,  b,  m,  k)  there  are  two  distinct  ways  in  which  the  history  can  evolve 
immediately  from  H,  depending  upon  whether  Blue  retaliates  or  not  during  the  next 
engagement  with  Red.  If  Blue  does  retaliate  we  have  an  evolution  of  the  form 

H H{b, ret)  =  {«  +  !, (^ b\m  +  \,n  +  \]  (16) 

on  the  assmnption  that  neither  party  to  the  engagement  is  destroyed.  To  achieve  the 
transition  in  (16),  Blue  needs  to  be  judged  by  the  sensor  to  be  of  type  b  and  also  to 
retaliate.  If  Blue  does  not  retaliate,  we  have  an  evolution  of  the  form 

H ^  H{b,nonret)  =  \n  +  \,{bjb),m,k\.  (17) 
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Let  u  e  and  LTe  Q(//).  From  Glazebrook  and  Greatrix  (1995)  we  need  to 


consider  the  transform  Th-  ->  5{Q(Fr)}  defined  by; 

{T„(u)}(H')  =  max  -  - [r,  X,  (^’  nonret)} 


D{H') 


+(1  -  rj,  )(1  -  St,)Y,j  ^db  u{H'{d,  nonret)} 
+(1  -rb)5b{l- 6b <l>db  u{H'{d, ret)}] 


^  u{H'{d,nonret)]  ; 


^{\-rb){l-5b)Y,J^u{H{d,  nonret)} 

+(1  -  r,  )db  (1  -  Ob  )Y,  hb  u{H{d,  ret)}] 

«onrer)}  . 

In  (18)  we  use  the  notations  established  in  (16)  and  (17),  together  with 

A^f)=E.n.{«(//)+H(i/)} 

with  similar  usage  for  /fe  Q.{H).  We  compute  the  index  G(n,  H)  by  noting  that 


hm{r^(M)}(i/)  =  G(n,/y) 


for  any  u  e  5{Q(F0}.  Observe  that  in  (17),  Th  denotes  an  «-fold  application  of  Th — ^i-e., 

Th  =  r;.(rr)  =  ))=.... 


17 


Appendix  C 


Some  Other  Extensions 

The  main  report  mentions  some  developments  of  the  simple  scenario  of  Section  1.  In 

(i)  -  (iii)  below,  we  identify  some  further  elaborations  for  which  index  policies  remain 
optimal.  In  (iv)  we  identify  other  possible  extensions  for  which  index  policies  will 
perform  well,  while  not  always  being  strictly  optimal. 

(i)  Each  Blue  type  has  a  finite  nmnber  of  bullets  (known  to  Red).  This  requires  a 
modest  elaboration  to  the  index  structure  and  index  policies  remain  optimal. 

(ii)  Red  has  a  finite  number  of  bullets.  Here  we  have  a  “finite  horizon”  version  of  the 
(potentially  infinite)  battle  depicted  in  Section  1.  The  index  policy  based  on 
i7(n,  n)  remains  optimal  for  the  case  that  these  are  all  decreasing  in  n. 

(iii)  Here  we  elaborate  the  simple  scenario  in  Section  1  by  allowing  all  Blues  that  are 
still  alive  to  take  a  shot  at  Red  (after  each  of  Red’s  shots),  and  not  simply  that 
Blue  which  was  targetted.  Suppose  that  Blue  number  j  has  a  probability  rjj  of 
killing  Red  (irrespective  of  which  Blue  type  he  is)  when  he  is  not  the  Blue 
targetted.  Typically  the  //’s  will  be  much  smaller  than  the  ^s.  Under  certain 
plausible  additional  conditions,  the  index  in  (8)  will  be  replaced  by  the  following 
for  Blue  number  j\ 

n.(l  -  r,  )■  (1  -  «.  )■{(>  -  -  (1  -  } 

(iv)  Other  versions  of  the  “finite  horizon”  problem  in  (ii)  for  which  the  indices  are 
not  all  decreasing  are  not  strictly  indexable,  but  index  policies  will  usually 
continue  to  do  well.  The  same  holds  for  a  suggested  development  in  which  each 
Blue  would  remain  in  the  targetting  zone  for  Red  for  just  a  finite  amoxmt  of  time 
before  leaving  (having,  for  example,  run  out  of  fuel). 


Gj(Il,n)  =  max 

rSl 
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Appendix  D 


Simulation  Study 

This  appendix  reports  on  results  from  a  simulation  model  implemented  by 
P.A.  Jacobs.  The  scenario  is  as  in  Section  1  of  the  main  report  with  Blue  targets  being  of 
two  types.  There  are  b\  type  1  Blue  targets  and  ^2  type  2  Blue  targets.  Red  uses  a  sensor 
to  initially  estimate  the  type  of  each  Blue  target.  The  probability  that  Red  classifies  a  type 
i  target  as  type  i  is  (^c,  otherwise  it  is  classified  as  the  other  type.  Natural  priors  for  Red  to 
use  in  this  context  are  (see  Appendix  A): 

(a)  for  those  Blues  judged  to  be  of  type  1 : 

n*  =  ^11 /(^ii  +  ^12)  =  1 

(b)  for  those  Blues  judged  to  be  of  type  2: 

nf  =  ^21  /  (^21  +  ^22 )  =  1 — n?  • 

The  simulation  model  implements  two  shooting  policies  for  Red:  (i)  an  index  policy 
(as  in  Section  3  of  the  main  report)  with  assigned  values  of  Fi  =  ¥2=  1,  a=\\  and 
(ii)  random  shooting  in  which,  at  each  decision  epoch,  Red  chooses  to  engage  one  of  the 
remaining  Blues  chosen  at  random  (with  equal  probabilities).  Some  results  are  presented 
in  Tables  1  and  2.  In  each  cell  of  both  tables  we  report  the  estimated  mean  number  of 
Blues  killed  prior  to  Red’s  destruction,  with  the  corresponding  standard  error  in  brackets. 
The  upper  figures  in  each  cell  correspond  to  the  index  policy  and  the  lower  to  the  random 
shooting  policy.  In  all  runs  we  take  =  ^2  =  <f>-  All  entries  are  based  on 
100  replications. 
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TABLE  1 


(Blue  types  very  different:  i'i  =  0.9,  ^  =  0.1;  #'2  =  0.1,  ^  =  0.9) 


(bub2) 

(2,8) 

(4,6) 

(6,4) 

(8,2) 

■■ 

(i) 

2.04  (0.04) 

3.98  (0.05) 

5.65  (0.15) 

7.82(0.15) 

■1 

(ii) 

0.39  (0.05) 

0.83  (0.11) 

1.41  (0.14) 

3.42  (0.27) 

0.95 

(i) 

1.66(0.08) 

3.36  (0.13) 

5.25  (0.17) 

7.28  (0.18) 

(ii) 

0.34  (0.06) 

0.72(0.11) 

1.33  (0.17) 

2.74  (0.26) 

(i) 

1.35  (0.10) 

2.82  (0.15) 

4.60  (0.21) 

6.46  (0.23) 

mui 

(ii) 

0.33  (0.06) 

0.59  (0.09) 

1.59(0.16) 

3.03  (0.26) 

mm 

(i) 

1.19(0.09) 

2.29  (0.16) 

4.20  (0.21) 

6.13  (0.24) 

mu 

(ii) 

0.39  (0.08) 

0.81  (0.10) 

1.32(0.18) 

3.27  (0.27) 

ig 

(i) 

issshhI 

mmmm 

mu 

(ii) 

lilUilUl 

■iWr^ 

(i) 

2.88  (0.20) 

4.28  (0.27) 

Hi 

msm 

KBCIUUIB 

1.28(0.14) 

2.70  (0.24) 

(i) 

■niaiMMjH 

mmmm 

WBmgm 

(ii) 

■indil 

(i) 

0.30  (0.06) 

0.86  (0.11) 

1.49  (0.16) 

2.80  (0.25) 

mu 

(ii) 

0.30  (0.05) 

0.81  (0.11) 

1.52(0.16) 

2.56  (0.25) 

The  mean  number  of  Blues  killed  by  Red  prior  to  Red’s  own  destruction 
under  (i)  an  index  policy  and  (ii)  a  random  shooting  policy. 


TABLE  2 

(Blue  types  more  alike:  ri  =  0.7,  di  =  0.3;  #'2  =  0.3,  0z  =  0.7) 


<!> 

(bub2) 

(2,8) 

(4,6) 

(6,4) 

(8,2) 

■■ 

(i) 

2.13  (0.11) 

3.40(0.19) 

KSEBHI 

5.44  (0.33) 

wM 

(ii) 

0.80  (0.11) 

1.38  (0.19) 

3.08  (0.33) 

0.95 

(i) 

3.04  (0.20) 

4.08  (0.26) 

4.38  (0.32) 

(ii) 

1.30  (0.16) 

2.50  (0.28) 

3.29  (0.29) 

0.9 

(i) 

1.55  (0.14) 

3.19  (0.22) 

4.74  (0.33) 

(ii) 

0.85  (0.11) 

1.48  (0.19) 

3.38  (0.30) 

0.85 

(i) 

1.61  (0.14) 

2.47  (0.21) 

3.99  (0.28) 

4.75  (0.32) 

(ii) 

0.95  (0.16) 

1.41  (0.15) 

1.66  (0.19) 

2.78  (0.26) 

0.8 

(i) 

1.22  (0.14) 

2.11  (0.18) 

3.32  (0.25) 

4.74  (0.32) 

(ii) 

0.87(0.11) 

1.27  (0.17) 

2.09  (0.21) 

2.58  (0.28) 

0.7 

(i) 

1.30  (0.14) 

1.70  (0.18) 

2.95  (0.26) 

(ii) 

0.74  (0.12) 

1.71  (0.21) 

2.38  (0.24) 

0.6 

(i) 

■ifeftwriM 

1.83  (0.20) 

2.43  (0.21) 

3.91  (0.31) 

(ii) 

1.34(0.16) 

2.20  (0.24) 

3.20  (0.29) 

0.5 

(i) 

0.98  (0.13) 

1.57  (0.20) 

1.67(0.20) 

3.09  (0.29) 

(ii) 

0.80  (0.13) 

1.33  (0.17) 

2.05  (0.21) 

2.57  (0.29) 

The  mean  number  of  Blues  killed  by  Red  prior  to  Red’s  own  destruction 
under  (i)  an  index  policy  and  (ii)  a  random  shooting  policy. 
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Although  plainly  a  more  extensive  simulation  study  (with  more  replication)  is 
desirable,  certain  major  features  are  already  transparent  from  Tables  1  and  2.  As  we 
might  expect,  the  index  policy  outperforms  the  random  shooting  policy  other  than  at 
0.5,  where  the  sensor  does  no  better  than  the  flip  of  a  fair  coin  and  the  two  policies 
are  virtually  identical.  The  level  of  excess  number  of  Blues  killed  achieved  by  the  index 
policy  is  remarkably  high  when  Red  receives  high  quality  information  from  the  sensor 
assets  (i.e.,  ^  is  high).  However,  even  rather  mediocre  information  (^=  0.6,  say)  can  be 
put  to  very  good  use  by  Red.  The  value  of  the  information  to  Red  is  unsurprisingly 
greater  when  the  Blue  types  are  more  distinct. 
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